Google Web 1T 5-Grams Made Easy (but not for the computer)

نویسنده

  • Stefan Evert
چکیده

This paper introduces Web1T5-Easy, a simple indexing solution that allows interactive searches of the Web 1T 5-gram database and a derived database of quasi-collocations. The latter is validated against co-occurrence data from the BNC and ukWaC on the automatic identification of non-compositional VPC.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of Czech Web 1T 5-Gram Corpus and Its Comparison with Czech National Corpus Data

In this paper, newly issued Czech Web 1T 5-grams corpus created by Google and LDC is analysed and compared with reference n-gram corpus obtained from Czech National Corpus. Original 5-grams from both corpora were post-processed and statistical trigram language models of various vocabulary sizes and parameters were created. The comparison of various corpus statistics such as unique and total wor...

متن کامل

Real-Word Spelling Correction using Google Web 1T 3-grams

We present a method for detecting and correcting multiple real-word spelling errors using the Google Web 1T 3-gram data set and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Our method is focused mainly on how to improve the detection recall (the fraction of errors correctly detected) and the correction recall (the fraction of errors correc...

متن کامل

Unsupervised Approaches to Text Correction Using Google N-grams for English and Romanian

We present an unsupervised approach that can be applied to test corrections tasks such as real-word error correction, near-synonym choice, and preposition choice, using n-grams from the Google Web 1T dataset. We present in details the method for correcting preposition errors, which has two phases. We categorize the n-gram types based on the position of the gap that needs to be replaced with a p...

متن کامل

Ngram Search Engine

In this paper, we will describe an idea and its implementation for an ngram search engine for very large sets of ngrams. The engine supports queries with an arbitrary number of wildcards. It takes a fraction of a second for a search, and can provide the fillers of the wildcards. We implemented the system using two datasets. One is the 1 billion 5-grams provided by Google (Web 1T data), the othe...

متن کامل

Minimal Perfect Hash Rank: Compact Storage of Large N-gram Language Models

In this paper we propose a new method of compactly storing n-gram language models called Minimal Perfect Hash Rank (MPHR) that uses significantly less space than all known approaches. It requires O(n) construction time and allows for O(1) random access of probability values or frequency counts associated with n-grams. We make use of minimal perfect hashing to store fingerprints of n-grams in an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010